BUDT704 Project: Group 22 :

Scope of coding in Olympics 2040

Group Codelympics: Mahir Dave, Parth Nanwani, Priyanka Chib, Smriti Gangwani, Vishwa Thakkar, Zaid Dar

Introduction:

The Olympic Games, also known as the Games of the Olympiad, are a major international multi-sport event normally held once every four years (every two years before 1992 ). There are over 400 total events in 40 sports in the Olympic Tournament. The Summer Olympics consists of 33 different sports, and the Winter Olympics consists of 7 different sports. Various data is generated based on the events and their winning players. We will analyze the data of multiple Olympics events to find various inferences and correlations from the dataset in python using different visualizations. We have combined the olympics dataset with the national olympics committee to create a dataframe by countries and total medals won.

The project consists of a dataset belonging to the Olympics games from the year 1896 to 2016. The dataset includes biological statistics of the athletes, such as height, weight, age, and event-related statistics such as sports, medals won, country, etc. This dataset would help us analyze the evolution of the Olympics in various sports and events and also, analyze the performance of different countries and participants of a different gender.

Retrieved two datasets from www.sports-reference.com and acquired from kaggle.com. The data in the first set (‘athlete_events.csv’) contains all the information categories like name of the player, physical characteristics of players (age,sex,weighr,heigth), NOC (country code), assigned team, type of games, games participated in, season, event and medal ranging from 1896 to 2016. As we observe, the column NOC in this dataset specifies the National Olympic committee. In order to get the region data from the NOC, we merged two datasets. The other dataset (‘noc_regions.csv’) maps the NOC value with the region. Therefore we use this common column to merge two datasets into one. This could be useful to determine what players play for which region.Also, we have done web scrapping from https://statisticstimes.com/economy/projected-world-gdp-ranking.php to find GDP of top countries.

Questions of Interest:

All the questions above will help us determine what factors play a vital role in the performance of a country/player in olympics, while showing us trends and interesting details that we can use to make inferences regarding the various fields of data. The analysis performed will combine various factors such BMI, GDP and even Politics to determine the winning factor for various athletes belonging to different regions. The analysis performed not only derived the best possible outcome but also helps to improve the performance. Hence, for this project we mill majorly focus on Data Analysis

Data Cleaning:

As observed the data acquired had a lot of noise. For example, the column Team specifies the team for which a player participates for a particular game. We observed that the column had data with multiple string elements from which only the first string element made sense for data processing. We observed that if a player belonged to team China, the value of that particular column was China-1. Now the only data relevant is the name of the team which is China. Therefore, to clean this column, we split the column value and only assign the first element as the row value which would be useful for analysis and processing.

The columns related to biometrics of players such as age, weight and height had a lot of missing values. As these columns are very important in order to make data analysis, the rows with missing values couldn't be eliminated. Therefore in order to fill the NaN values, we grouped the Sport column with Gender. This grouping would give us biometrics of every person for a particular sport according to the age. Now it will be likely that the missing data like weight of a person would be closer to the mean weight of all the players participating in that particular sport. But the mean weight of participants of a Sport could be different for different genders. That is why these two columns were used to group players. Using the grouped sets and finding the mean of a required biometric and replacing NaN.

We observed that despite trying to assign the mean values for weight, height and age, there were some sports for which only one player participated which had NaN values. In this case, grouping using Sport and Gender did not make any change as there is only one participant. Many such rows had the same nature. Therefore for such cases, the overall mean of required biometric value after grouping by Gender is assigned for that particular row and column.

Impact of geographical location on Winter and Summer Olympics

Separting particpants based on season and grouping them based on region to determine the number of medals each region has won on each season

Looking at the heat map for the number of medals by country in the Winter season, there is a clear trend that is shown. Countries with colder climates (Canada, Russia) perform much better than other countries. On the other hand, countries close to the equator, which have typically warmer climates, perform much worse in the Winter Olympics. We can infer that the countries with colder climates have the ability to practice and train for Winter sports at a much higher capacity than countries with warm climates, allowing them to perform better during the Winter Olympics. For example, athletes from Canada and Russia are able to train skiing and snowboarding year round, while athletes from warmer countries can only train these sports when the weather allows them too. We can expect countries with colder climates to perform well in future Winter Olympic games.

Looking at the heat map for the number of medals by country in the Summer season, we see a different trend than the Winter heat map. The top performing countries (USA, Russia, Germany) have different climates, but they all are first-world countries that have a high population. We can infer that this is because even though the climate may be colder, popular Summer sporting events take place indoors (Gymnastics, Athletics, Swimming), so even athletes from countries with cold climates can still practice and train in these sports. The infrastructure in the developed countries is much better than under developed countries, so the training facilities athletes from developed countries have access to allow them to perform better than athletes from under developed countries. We can expect countries that are developed and have high populations will perform best in future Summer Olympic games.

Impact of GDP on countries performance

Using web scraping we are getting the GDP of top countries and merging it with olympics dataset to get a new data frame containing GDP and number of medals for each country

Many countries have started programs of higher sport funding to increase the athletes’ performance. The two factors that might impact performance of any country could be the population and GDP of that country. Countries with higher GDP could allocate more resources, better training and appropriate infrastructure to the participants. This could improve a participant's performance thus leading to better chances of them winning medals. We have clearly found a considerate correlation between GDP of a country and the number of medals won by that country in Olympics.Nations with higher GDP have higher medal values too. The above line graph shows GDP for 20 countries. We see a very big variation in these values as the economic status of every country is different. The bar graph shows the number of medals won by a particular country. We observe that for most of the countries, the value of the number of medals won is directly proportional to their GDP. USA being the country with highest GDP value has the highest number of medals won. There are cases where even though the GDP value isn't very high, the country still has won a considerable number of medals. For example, Germany has a huge number of winners but that value is not proportionate with GDP. This could be because despite low GDP value, Germany has a lot of people participating every year, which increases their probability of winning.

Comparing two nations India and Canada, the GDP of India is much higher than that of Canada. But we observe that Canada still has more medals than India. Canada is a highly developed nation with one of the largest economies in the world whereas India still is a developing nation. This could be a reason why some countries in the above graph are outliers.

So it's understood that there is an obvious connection between the GDP of a country and its performance at the Olympics. But this relationship is very dynamic as there are a lot of other factors that play a major role in determining the performance of a nation.

Impact of number of participants on the success rate of USA

To find out the success rate of participants winning each year we are selecting USA as it has highest number of medals and particpants, giving us better data to analyse.

It is quite evident that if number of particpants, increases the number of medals increases. But this affects the success rate of participants. From the above graph we can see that majority of the time as participants number is high like in 1904,1988 and 1992 the success rate decreases. Instead when the participants is less or moderate the sucess ratio is high like in year 2008, 2010, 2012 and 2016. This is because if you have few particpants, it is easy to focus on each of them and give them better facilities to everyone. It helps particpants feel more valuable and gives them more motivation to perform better and enhance their chance of winning a medal.

Impact of BMI on Winning for each sport

Comparing the BMI of particpants who have won the medal and the particpants who lost for each sport

While BMI does not measure body fat directly, it correlates pretty closely to direct measures of body fat. Therefore, BMI is an alternative for direct measures of body fat. BMI is the ratio of a person's weight to the square of height. The observation from the heat map explicitly shows that for the majority of the sports, the BMI of medal winners is comparatively higher than the BMI of non-winners. We can also see that for some sports like Rythmic Gymnastic, the BMI of winner is lower than the BMI of non-winners.

For the Sports that require more physical strength such as Weight lifting and Rugby, the players need to be heavy in weight. Since a heavy player outweighs the lighter weight player, the BMI can be seen higher in these sports. In sports that require more flexibility and grace such as Gymnastics, the player needs to be lighter in weight. Hence the BMI is low. There are some exceptions like Tug of War, in which it seems that the weight of the player is a significant factor to decide the winning team. However, it is the strategies that are equally important too to predict a winner.

Effects of International Politics on Olympics

Our objective here was to analyse the data to compare the number of Athletes participating in the Olympics over the years. As Olympics gained popularity and many countires gained Independence, it was expected that there would be a linear or exponential increase in the number of participating athletes. However, that was not the case. Hence, we further analyse the trend and the causes of these trends. The number of Medals vs the number of participating athletes can let us know whether the number of participants is less because of a sport no longer being recognised by Olympics or because of some other reasons.

From the line chart, we can see that there was no Olympics in the year 1916. This is because of World War 1. Similarly, the 1940 and 1944 Olympics were cancelled due to World War 2.

After World War 2, there was no major conflict that would lead to cancellation of the Olympics. However, due to political reasons, there were boycotts of the Olympics by countries. In total, there have been 6 instances when a few countries decided not to participate because of political reasons.

The Olympics boycotted were 1956, 1964, 1976, 1980, 1984, 1988. The reasons are as follows:

Conclusion

Our overall goal was to analyze the data and create visualizations to explore the key factors that affect countries’ performance in the Olympic Games. We aimed to analyze which factors were most important to winning, and find any trends that may allow us to predict which countries will perform best in the future Olympic Games. While performing our analysis, we also found that there were large gaps between some years that can be attributed to external political effects. Based on our analysis, we found there are several key indicators that lead to strong performance in the Olympic Games such as:

References: